66 research outputs found

    Automatic de-identification of textual documents in the electronic health record: a review of recent research

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In the United States, the Health Insurance Portability and Accountability Act (HIPAA) protects the confidentiality of patient data and requires the informed consent of the patient and approval of the Internal Review Board to use data for research purposes, but these requirements can be waived if data is de-identified. For clinical data to be considered de-identified, the HIPAA "Safe Harbor" technique requires 18 data elements (called PHI: Protected Health Information) to be removed. The de-identification of narrative text documents is often realized manually, and requires significant resources. Well aware of these issues, several authors have investigated automated de-identification of narrative text documents from the electronic health record, and a review of recent research in this domain is presented here.</p> <p>Methods</p> <p>This review focuses on recently published research (after 1995), and includes relevant publications from bibliographic queries in PubMed, conference proceedings, the ACM Digital Library, and interesting publications referenced in already included papers.</p> <p>Results</p> <p>The literature search returned more than 200 publications. The majority focused only on structured data de-identification instead of narrative text, on image de-identification, or described manual de-identification, and were therefore excluded. Finally, 18 publications describing automated text de-identification were selected for detailed analysis of the architecture and methods used, the types of PHI detected and removed, the external resources used, and the types of clinical documents targeted. All text de-identification systems aimed to identify and remove person names, and many included other types of PHI. Most systems used only one or two specific clinical document types, and were mostly based on two different groups of methodologies: pattern matching and machine learning. Many systems combined both approaches for different types of PHI, but the majority relied only on pattern matching, rules, and dictionaries.</p> <p>Conclusions</p> <p>In general, methods based on dictionaries performed better with PHI that is rarely mentioned in clinical text, but are more difficult to generalize. Methods based on machine learning tend to perform better, especially with PHI that is not mentioned in the dictionaries used. Finally, the issues of anonymization, sufficient performance, and "over-scrubbing" are discussed in this publication.</p

    Actuation of Micro-Optomechanical Systems Via Cavity-Enhanced Optical Dipole Forces

    Get PDF
    We demonstrate a new type of optomechanical system employing a movable, micron-scale waveguide evanescently-coupled to a high-Q optical microresonator. Micron-scale displacements of the waveguide are observed for milliwatt(mW)-level optical input powers. Measurement of the spatial variation of the force on the waveguide indicates that it arises from a cavity-enhanced optical dipole force due to the stored optical field of the resonator. This force is used to realize an all-optical tunable filter operating with sub-mW control power. A theoretical model of the system shows the maximum achievable force to be independent of the intrinsic Q of the optical resonator and to scale inversely with the cavity mode volume, suggesting that such forces may become even more effective as devices approach the nanoscale.Comment: 4 pages, 5 figures. High resolution version available at (http://copilot.caltech.edu/publications/CEODF_hires.pdf). For associated movie, see (http://copilot.caltech.edu/research/optical_forces/index.htm

    Clinical narrative analytics challenges

    Get PDF
    Precision medicine or evidence based medicine is based on the extraction of knowledge from medical records to provide individuals with the appropriate treatment in the appropriate moment according to the patient features. Despite the efforts of using clinical narratives for clinical decision support, many challenges have to be faced still today such as multilinguarity, diversity of terms and formats in different services, acronyms, negation, to name but a few. The same problems exist when one wants to analyze narratives in literature whose analysis would provide physicians and researchers with highlights. In this talk we will analyze challenges, solutions and open problems and will analyze several frameworks and tools that are able to perform NLP over free text to extract medical entities by means of Named Entity Recognition process. We will also analyze a framework we have developed to extract and validate medical terms. In particular we present two uses cases: (i) medical entities extraction of a set of infectious diseases description texts provided by MedlinePlus and (ii) scales of stroke identification in clinical narratives written in Spanish

    Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Natural Language Processing (NLP) systems can be used for specific Information Extraction (IE) tasks such as extracting phenotypic data from the electronic medical record (EMR). These data are useful for translational research and are often found only in free text clinical notes. A key required step for IE is the manual annotation of clinical corpora and the creation of a reference standard for (1) training and validation tasks and (2) to focus and clarify NLP system requirements. These tasks are time consuming, expensive, and require considerable effort on the part of human reviewers.</p> <p>Methods</p> <p>Using a set of clinical documents from the VA EMR for a particular use case of interest we identify specific challenges and present several opportunities for annotation tasks. We demonstrate specific methods using an open source annotation tool, a customized annotation schema, and a corpus of clinical documents for patients known to have a diagnosis of Inflammatory Bowel Disease (IBD). We report clinician annotator agreement at the document, concept, and concept attribute level. We estimate concept yield in terms of annotated concepts within specific note sections and document types.</p> <p>Results</p> <p>Annotator agreement at the document level for documents that contained concepts of interest for IBD using estimated Kappa statistic (95% CI) was very high at 0.87 (0.82, 0.93). At the concept level, F-measure ranged from 0.61 to 0.83. However, agreement varied greatly at the specific concept attribute level. For this particular use case (IBD), clinical documents producing the highest concept yield per document included GI clinic notes and primary care notes. Within the various types of notes, the highest concept yield was in sections representing patient assessment and history of presenting illness. Ancillary service documents and family history and plan note sections produced the lowest concept yield.</p> <p>Conclusion</p> <p>Challenges include defining and building appropriate annotation schemas, adequately training clinician annotators, and determining the appropriate level of information to be annotated. Opportunities include narrowing the focus of information extraction to use case specific note types and sections, especially in cases where NLP systems will be used to extract information from large repositories of electronic clinical note documents.</p

    Optimising medication data collection in a large-scale clinical trial

    Get PDF
    © 2019 Lockery et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Objective: Pharmaceuticals play an important role in clinical care. However, in community-based research, medication data are commonly collected as unstructured free-text, which is prohibitively expensive to code for large-scale studies. The ASPirin in Reducing Events in the Elderly (ASPREE) study developed a two-pronged framework to collect structured medication data for 19,114 individuals. ASPREE provides an opportunity to determine whether medication data can be cost-effectively collected and coded, en masse from the community using this framework. Methods: The ASPREE framework of type-to-search box with automated coding and linked free text entry was compared to traditional method of free-text only collection and post hoc coding. Reported medications were classified according to their method of collection and analysed by Anatomical Therapeutic Chemical (ATC) group. Relative cost of collecting medications was determined by calculating the time required for database set up and medication coding. Results Overall, 122,910 participant structured medication reports were entered using the type-tosearch box and 5,983 were entered as free-text. Free-text data contributed 211 unique medications not present in the type-to-search box. Spelling errors and unnecessary provision of additional information were among the top reasons why medications were reported as freetext. The cost per medication using the ASPREE method was approximately USD 0.03comparedwithUSD0.03 compared with USD 0.20 per medication for the traditional method. Conclusion Implementation of this two-pronged framework is a cost-effective alternative to free-text only data collection in community-based research. Higher initial set-up costs of this combined method are justified by long term cost effectiveness and the scientific potential for analysis and discovery gained through collection of detailed, structured medication data

    Assessing the accuracy of an inter-institutional automated patient-specific health problem list

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Health problem lists are a key component of electronic health records and are instrumental in the development of decision-support systems that encourage best practices and optimal patient safety. Most health problem lists require initial clinical information to be entered manually and few integrate information across care providers and institutions. This study assesses the accuracy of a novel approach to create an inter-institutional automated health problem list in a computerized medical record (MOXXI) that integrates three sources of information for an individual patient: diagnostic codes from medical services claims from all treating physicians, therapeutic indications from electronic prescriptions, and single-indication drugs.</p> <p>Methods</p> <p>Data for this study were obtained from 121 general practitioners and all medical services provided for 22,248 of their patients. At the opening of a patient's file, all health problems detected through medical service utilization or single-indication drug use were flagged to the physician in the MOXXI system. Each new arising health problem were presented as 'potential' and physicians were prompted to specify if the health problem was valid (Y) or not (N) or if they preferred to reassess its validity at a later time.</p> <p>Results</p> <p>A total of 263,527 health problems, representing 891 unique problems, were identified for the group of 22,248 patients. Medical services claims contributed to the majority of problems identified (77%), followed by therapeutic indications from electronic prescriptions (14%), and single-indication drugs (9%). Physicians actively chose to assess 41.7% (n = 106,950) of health problems. Overall, 73% of the problems assessed were considered valid; 42% originated from medical service diagnostic codes, 11% from single indication drugs, and 47% from prescription indications. Twelve percent of problems identified through other treating physicians were considered valid compared to 28% identified through study physician claims.</p> <p>Conclusion</p> <p>Automation of an inter-institutional problem list added over half of all validated problems to the health problem list of which 12% were generated by conditions treated by other physicians. Automating the integration of existing information sources provides timely access to accurate and relevant health problem information. It may also accelerate the uptake and use of electronic medical record systems.</p

    Measuring diversity in medical reports based on categorized attributes and international classification systems

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Narrative medical reports do not use standardized terminology and often bring insufficient information for statistical processing and medical decision making. Objectives of the paper are to propose a method for measuring diversity in medical reports written in any language, to compare diversities in narrative and structured medical reports and to map attributes and terms to selected classification systems.</p> <p>Methods</p> <p>A new method based on a general concept of f-diversity is proposed for measuring diversity of medical reports in any language. The method is based on categorized attributes recorded in narrative or structured medical reports and on international classification systems. Values of categories are expressed by terms. Using SNOMED CT and ICD 10 we are mapping attributes and terms to predefined codes. We use f-diversities of Gini-Simpson and Number of Categories types to compare diversities of narrative and structured medical reports. The comparison is based on attributes selected from the Minimal Data Model for Cardiology (MDMC).</p> <p>Results</p> <p>We compared diversities of 110 Czech narrative medical reports and 1119 Czech structured medical reports. Selected categorized attributes of MDMC had mostly different numbers of categories and used different terms in narrative and structured reports. We found more than 60% of MDMC attributes in SNOMED CT. We showed that attributes in narrative medical reports had greater diversity than the same attributes in structured medical reports. Further, we replaced each value of category (term) used for attributes in narrative medical reports by the closest term and the category used in MDMC for structured medical reports. We found that relative Gini-Simpson diversities in structured medical reports were significantly smaller than those in narrative medical reports except the "Allergy" attribute.</p> <p>Conclusions</p> <p>Terminology in narrative medical reports is not standardized. Therefore it is nearly impossible to map values of attributes (terms) to codes of known classification systems. A high diversity in narrative medical reports terminology leads to more difficult computer processing than in structured medical reports and some information may be lost during this process. Setting a standardized terminology would help healthcare providers to have complete and easily accessible information about patients that would result in better healthcare.</p

    Data-driven approach for creating synthetic electronic medical records

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>New algorithms for disease outbreak detection are being developed to take advantage of full electronic medical records (EMRs) that contain a wealth of patient information. However, due to privacy concerns, even anonymized EMRs cannot be shared among researchers, resulting in great difficulty in comparing the effectiveness of these algorithms. To bridge the gap between novel bio-surveillance algorithms operating on full EMRs and the lack of non-identifiable EMR data, a method for generating complete and synthetic EMRs was developed.</p> <p>Methods</p> <p>This paper describes a novel methodology for generating complete synthetic EMRs both for an outbreak illness of interest (tularemia) and for background records. The method developed has three major steps: 1) synthetic patient identity and basic information generation; 2) identification of care patterns that the synthetic patients would receive based on the information present in real EMR data for similar health problems; 3) adaptation of these care patterns to the synthetic patient population.</p> <p>Results</p> <p>We generated EMRs, including visit records, clinical activity, laboratory orders/results and radiology orders/results for 203 synthetic tularemia outbreak patients. Validation of the records by a medical expert revealed problems in 19% of the records; these were subsequently corrected. We also generated background EMRs for over 3000 patients in the 4-11 yr age group. Validation of those records by a medical expert revealed problems in fewer than 3% of these background patient EMRs and the errors were subsequently rectified.</p> <p>Conclusions</p> <p>A data-driven method was developed for generating fully synthetic EMRs. The method is general and can be applied to any data set that has similar data elements (such as laboratory and radiology orders and results, clinical activity, prescription orders). The pilot synthetic outbreak records were for tularemia but our approach may be adapted to other infectious diseases. The pilot synthetic background records were in the 4-11 year old age group. The adaptations that must be made to the algorithms to produce synthetic background EMRs for other age groups are indicated.</p
    corecore